-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v6e TPU Head Resource Autoscaling Support #48201
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
This PR was manually tested as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor nit
Signed-off-by: Ryan O'Leary <[email protected]>
@kevin85421 @hongchaodeng pinging this again to see if it can get reviewed/approved by a code-owner, with v6e TPUs in private preview in GKE it would be good to ensure the Ray autoscaler support is there |
default_num_cores_per_chip = 2 | ||
if generation == "v5e": | ||
default_num_cores_per_chip = 1 | ||
default_num_cores_per_chip = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I determine the exact value of default_num_cores_per_chip
to verify that this logic is correct? I briefly used Ctrl + F to search for some keywords in https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#run, but I couldn't find the information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relation between cores to number of chips for each TPU generation in one spot is viewable in the Cloud TPU documentation for each version. Under System Architecture
it states the number of TensorCores per TPU chip.
I have already pinged my colleague to merge this PR. |
Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: hjiang <[email protected]>
Why are these changes needed?
This PR adds
tpu-v6e-slice
to the list of known TPU accelerators, enabling the KubeRay autoscaler to add aTPU-v6e...-Head
resource to the autoscaling resource config. This PR also adds a unit test to covertest_tpu_node_selectors_to_type
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.